8 research outputs found
Instruct-Align: Teaching Novel Languages with to LLMs through Alignment-based Cross-Lingual Instruction
Instruction-tuned large language models (LLMs) have shown remarkable
generalization capability over multiple tasks in multiple languages.
Nevertheless, their generalization towards different languages varies
especially to underrepresented languages or even to unseen languages. Prior
works on adapting new languages to LLMs find that naively adapting new
languages to instruction-tuned LLMs will result in catastrophic forgetting,
which in turn causes the loss of multitasking ability in these LLMs. To tackle
this, we propose the Instruct-Align a.k.a (IA) framework, which enables
instruction-tuned LLMs to learn cross-lingual alignment between unseen and
previously learned languages via alignment-based cross-lingual
instruction-tuning. Our preliminary result on BLOOMZ-560M shows that (IA)
is able to learn a new language effectively with only a limited amount of
parallel data and at the same time prevent catastrophic forgetting by applying
continual instruction-tuning through experience replay. Our work contributes to
the progression of language adaptation methods for instruction-tuned LLMs and
opens up the possibility of adapting underrepresented low-resource languages
into existing instruction-tuned LLMs. Our code will be publicly released upon
acceptance
InstructTODS: Large Language Models for End-to-End Task-Oriented Dialogue Systems
Large language models (LLMs) have been used for diverse tasks in natural
language processing (NLP), yet remain under-explored for task-oriented dialogue
systems (TODS), especially for end-to-end TODS. We present InstructTODS, a
novel off-the-shelf framework for zero-shot end-to-end task-oriented dialogue
systems that can adapt to diverse domains without fine-tuning. By leveraging
LLMs, InstructTODS generates a proxy belief state that seamlessly translates
user intentions into dynamic queries for efficient interaction with any KB. Our
extensive experiments demonstrate that InstructTODS achieves comparable
performance to fully fine-tuned TODS in guiding dialogues to successful
completion without prior knowledge or task-specific data. Furthermore, a
rigorous human evaluation of end-to-end TODS shows that InstructTODS produces
dialogue responses that notably outperform both the gold responses and the
state-of-the-art TODS in terms of helpfulness, informativeness, and humanness.
Moreover, the effectiveness of LLMs in TODS is further supported by our
comprehensive evaluations on TODS subtasks: dialogue state tracking, intent
classification, and response generation. Code and implementations could be
found here https://github.com/WillyHC22/InstructTODS
Survey of Social Bias in Vision-Language Models
In recent years, the rapid advancement of machine learning (ML) models,
particularly transformer-based pre-trained models, has revolutionized Natural
Language Processing (NLP) and Computer Vision (CV) fields. However, researchers
have discovered that these models can inadvertently capture and reinforce
social biases present in their training datasets, leading to potential social
harms, such as uneven resource allocation and unfair representation of specific
social groups. Addressing these biases and ensuring fairness in artificial
intelligence (AI) systems has become a critical concern in the ML community.
The recent introduction of pre-trained vision-and-language (VL) models in the
emerging multimodal field demands attention to the potential social biases
present in these models as well. Although VL models are susceptible to social
bias, there is a limited understanding compared to the extensive discussions on
bias in NLP and CV. This survey aims to provide researchers with a high-level
insight into the similarities and differences of social bias studies in
pre-trained models across NLP, CV, and VL. By examining these perspectives, the
survey aims to offer valuable guidelines on how to approach and mitigate social
bias in both unimodal and multimodal settings. The findings and recommendations
presented here can benefit the ML community, fostering the development of
fairer and non-biased AI models in various applications and research endeavors
Contrastive Learning for Inference in Dialogue
Inference, especially those derived from inductive processes, is a crucial
component in our conversation to complement the information implicitly or
explicitly conveyed by a speaker. While recent large language models show
remarkable advances in inference tasks, their performance in inductive
reasoning, where not all information is present in the context, is far behind
deductive reasoning. In this paper, we analyze the behavior of the models based
on the task difficulty defined by the semantic information gap -- which
distinguishes inductive and deductive reasoning (Johnson-Laird, 1988, 1993).
Our analysis reveals that the disparity in information between dialogue
contexts and desired inferences poses a significant challenge to the inductive
inference process. To mitigate this information gap, we investigate a
contrastive learning approach by feeding negative samples. Our experiments
suggest negative samples help models understand what is wrong and improve their
inference generations.Comment: Accepted to EMNLP202
Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
While code-mixing is a common linguistic practice in many parts of the world,
collecting high-quality and low-cost code-mixed data remains a challenge for
natural language processing (NLP) research. The proliferation of Large Language
Models (LLMs) in recent times compels one to ask: can these systems be used for
data generation? In this article, we explore prompting LLMs in a zero-shot
manner to create code-mixed data for five languages in South East Asia (SEA) --
Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language
Singlish. We find that ChatGPT shows the most potential, capable of producing
code-mixed text 68% of the time when the term "code-mixing" is explicitly
defined. Moreover, both ChatGPT and InstructGPT's (davinci-003) performances in
generating Singlish texts are noteworthy, averaging a 96% success rate across a
variety of prompts. The code-mixing proficiency of ChatGPT and InstructGPT,
however, is dampened by word choice errors that lead to semantic inaccuracies.
Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce
code-mixed texts altogether. By highlighting the limited promises of LLMs in a
specific form of low-resource data generation, we call for a measured approach
when applying similar techniques to other data-scarce NLP contexts
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
Democratizing access to natural language processing (NLP) technology is
crucial, especially for underrepresented and extremely low-resource languages.
Previous research has focused on developing labeled and unlabeled corpora for
these languages through online scraping and document translation. While these
methods have proven effective and cost-efficient, we have identified
limitations in the resulting corpora, including a lack of lexical diversity and
cultural relevance to local communities. To address this gap, we conduct a case
study on Indonesian local languages. We compare the effectiveness of online
scraping, human translation, and paragraph writing by native speakers in
constructing datasets. Our findings demonstrate that datasets generated through
paragraph writing by native speakers exhibit superior quality in terms of
lexical diversity and cultural content. In addition, we present the
\datasetname{} benchmark, encompassing 12 underrepresented and extremely
low-resource languages spoken by millions of individuals in Indonesia. Our
empirical experiment results using existing multilingual large language models
conclude the need to extend these models to more underrepresented languages. We
release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
We present NusaCrowd, a collaborative initiative to collect and unify
existing resources for Indonesian languages, including opening access to
previously non-public resources. Through this initiative, we have brought
together 137 datasets and 118 standardized data loaders. The quality of the
datasets has been assessed manually and automatically, and their value is
demonstrated through multiple experiments. NusaCrowd's data collection enables
the creation of the first zero-shot benchmarks for natural language
understanding and generation in Indonesian and the local languages of
Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual
automatic speech recognition benchmark in Indonesian and the local languages of
Indonesia. Our work strives to advance natural language processing (NLP)
research for languages that are under-represented despite being widely spoken
ASCEND: A Spontaneous Chinese-English Dataset for Code-switching in Multi-turn Conversation
Code-switching is a speech phenomenon occurring when a speaker switches
language during a conversation. Despite the spontaneous nature of
code-switching in conversational spoken language, most existing works collect
code-switching data from read speech instead of spontaneous speech. ASCEND (A
Spontaneous Chinese-English Dataset) is a high-quality Mandarin Chinese-English
code-switching corpus built on spontaneous multi-turn conversational dialogue
sources collected in Hong Kong. We report ASCEND's design and procedure for
collecting the speech data, including annotations. ASCEND consists of 10.62
hours of clean speech, collected from 23 bilingual speakers of Chinese and
English. Furthermore, we conduct baseline experiments using pre-trained wav2vec
2.0 models, achieving a best performance of 22.69\% character error rate and
27.05% mixed error rate